Data Summarization with Informative Itemsets
نویسندگان
چکیده
Data analysis is an inherently iterative process. That is, what we know about the data greatly determines our expectations, and hence, what result we would find the most interesting. With this in mind, we introduce a well-founded approach for succinctly summarizing data with a collection of informative itemsets; using a probabilistic maximum entropy model, we iteratively find the most interesting itemset, and in turn update our model of the data accordingly. As we only include itemsets that are surprising with regard to the current model, the summary is guaranteed to be both descriptive and non-redundant. The algorithm that we present can either mine the top-k most interesting itemsets, or use the Bayesian Information Criterion to automatically identify the model containing only the itemsets most important for describing the data. Or, in other words, it will ‘tell you what you need to know’. Experiments on synthetic and benchmark data show that the discovered summaries are succinct, and correctly identify the key patterns in the data. The models they form attain high likelihoods, and inspection shows that they summarize the data well with increasingly specific, yet non-redundant itemsets. 1 Informative and Succinct Summarization with Itemsets Knowledge discovery from data is an inherently iterative process. What we already know about the data greatly determines our expectations, and therefore, which results we would find interesting and/or surprising. Early on in the process of analyzing a database, for instance, we are happy to learn about the generalities underlying the data, while later on we will be more interested in the specifics that build upon these concepts. Essentially, this process comes down to summarization: we want to know what is interesting in the data, and we want this to be reported succinctly and without redundancy. As natural as it may seem to update a knowledge model during the discovery process, few pattern mining techniques actually follow such a dynamic approach of discovering patterns that are surprising with regard to what we have learned so far. That is, while many techniques provide a series of patterns in order of interestingness, most score these patterns using a static model; during this process the model, and hence the itemset scores, are not updated with the knowledge gained from previously discovered patterns. The static approach gives rise to the typical problem of traditional pattern mining: overwhelmingly large and highly redundant collections of patterns. Our goal therefore is to discover the set of itemsets that provides the most important information about the data, while containing as little redundancy as possible. To model the data, we construct a maximum entropy distribution that allows us to directly calculate the expected frequencies of itemsets. Then, at each iteration, we return the itemset that provides the most information, i.e., for which our frequency estimate according to the model was most off. We update our model with this new knowledge, and continue the process. The non-redundant model that contains the most important information is thus automatically identified. Therefore, we paraphrase our method as ‘tell me what I need to know’. ∗Michael Mampaey is supported by the Agency for Innovation by Science and Technology in Flanders (IWT). ∗∗Nikolaj Tatti and Jilles Vreeken are supported by Post-Doctoral Fellowships of the Research Foundation—Flanders (FWO). 2 Identifying the Best Summary Our objective is to find a succinct summary of a binary dataset, that is, to obtain a small, yet high-quality set of itemsets C, that describes key characteristics of the data at hand, D, in order to gain useful insights. To model the data, we use the powerful and versatile class of maximum entropy models. This is a class of probabilistic models that are identified by the Maximum Entropy principle [1] as those models that make optimal use of the provided information. That is, they rely only on this information and are fully unbiased otherwise. For a collection of itemsets C, we construct the distribution pC which satisfies all the frequency constraints imposed by C, and maximizes the entropy H(pC). It can be shown that pC has a log-linear form; the parameters of this distribution can be found using the Iterative Scaling procedure [2]. While solving and querying the maximum entropy model is infeasible in general, we show that in our setting this can be accomplished efficiently, depending on the amount of overlap between the selected patterns. This method groups transactions into blocks, according to an equivalence relation induced by C, and employs the Inclusion-Exclusion principle to efficiently calculate probabilities. To evaluate the quality of a collection of itemsets as a summary for a dataset, we use the Bayesian Information Criterion (BIC) [4], which favors models that fit the data well with few parameters, and is defined as s(C) = − log pC(D) + 1/2 |C| log |D|. The smaller this score, the better the model. The first term is simply the negative log-likelihood of the model, while the second term is a penalty on the number of parameters—the number of itemsets in our case. Consequently, the best model is identified as the model that provides a good balance between high likelihood and low complexity. Moreover, we automatically avoid redundancy, since models with redundant itemsets are penalized for being too complex, without sufficiently improving the likelihood. 3 Efficiently Mining the Summary To mine the summary from the data, we present the MTV algorithm, which mines succinct summaries with Maximally informaTiVe itemsets. The algorithm constructs a summary C by iteratively adding the itemset which provides the most information, i.e., which decreases the quality score function s(C) the most. The model is then updated to incorporate this new knowledge. We show that this is equivalent to adding the itemset which maximizes the Kullback-Leibler divergence between the two consecutive maximum entropy distributions, and present a computationally efficient heuristic which approximates the Kullback-Leibler divergence, and which expresses the divergence between the itemset’s estimated and observed frequency. Since this heuristic is convex, mining the itemset which optimizes it can easily be achieved using the branch and bound technique as proposed by Nijssen et al. [3]. This approach allows us to mine our collection of itemsets on the fly, rather than taking a two-phase approach where we would pick them from a larger candidate set which would have to be mined and stored beforehand. The user can also easily infuse background knowledge into the model (in the form of itemset frequencies, e.g., those of the individual items), to avoid discovering patterns that are redundant with regard to what he or she already knows.
منابع مشابه
Quantifying the informativeness for biomedical literature summarization: An itemset mining method
OBJECTIVE Automatic text summarization tools can help users in the biomedical domain to access information efficiently from a large volume of scientific literature and other sources of text documents. In this paper, we propose a summarization method that combines itemset mining and domain knowledge to construct a concept-based model and to extract the main subtopics from an input document. Our ...
متن کاملThe Pareto Principle Is Everywhere: Finding Informative Sentences for Opinion Summarization Through Leader Detection
Most previous works on opinion summarization focus on summarizing sentiment polarity distribution towards different aspects of an entity (e.g., battery life and screen of a mobile phone). However, users’ demand may be more beyond this kind of opinion summarization. Besides such coarse-grained summarization on aspects, one may prefer to read detailed but concise text of the opinion data for more...
متن کاملSummarization -compressing Data into an Informative Representation Report Summarization -compressing Data into an Informative Representation Report Summarization -compressing Data into an Informative Representation
Summarization is an important problem in many domains involving large datasets. Summarization can be essentially viewed as transformation of data into a concise yet meaningful representation which could be used for efficient storage or manual inspection. In this paper, we formulate the problem of summarization of a large dataset of transactions as an optimization problem involving two objective...
متن کاملA survey on Automatic Text Summarization
Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...
متن کاملA new method for finding generalized frequent itemsets in generalized association rule mining
Generalized association rule mining is an extension of traditional association rule mining to discover more informative rules, given a taxonomy. In this paper, we describe a formal framework for the problem of mining generalized association rules. In the framework, The subset-superset and the parent-child relationships among generalized itemsets are introduced to present the different views of ...
متن کامل